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Abstract 

False friends are pairs of words in two languages 
that are perceived as similar, but have different 
meanings, e.g., Gift in German means poison in 
English. In this paper, we present several un- 
supervised algorithms for acquiring such pairs 
from a sentence- aligned bi-text. First, we try dif- 
ferent ways of exploiting simple statistics about 
monolingual word occurrences and cross-lingual 
word co-occurrences in the bi-text. Second, using 
methods from statistical machine translation, we 
induce word alignments in an unsupervised way, 
from which we estimate lexical translation prob- 
abilities, which we use to measure cross-lingual 
semantic similarity. Third, we experiment with 
a semantic similarity measure that uses the Web 
as a corpus to extract local contexts from text 
snippets returned by a search engine, and a bi- 
lingual glossary of known word translation pairs, 
used as "bridges" . Finally, all measures are com- 
bined and applied to the task of identifying likely 
false friends. The evaluation for Russian and 
Bulgarian shows a significant improvement over 
previously- proposed algorithms. 
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1 Introduction 

Words in two languages that are orthographically 
and/or phonetically similar are often perceived as mu- 
tual translations, which could be wrong in some con- 
texts. Such words are known as cognates^ when they 



*Also: Department of Computer Science, National Uni- 
versity of Singapore, 13 Computing Drive, Singapore 117417, 
nakov@comp.nus.edu.sg 
^ We should note that linguists define cognates as words de- 
rived from a common root regardless of whether they differ 
in meaning or not. For example, the Electronic Glossary of 
Linguistic Terms gives the following definition [3] : 



are mutual translations in all contexts, partial cognates 
when they are mutual translations in some contexts 
but not in other, and false friends when they are not 
mutual translations in any context. 

For example, the Bulgarian CASHi^e (slynce) and the 
Russian coAHi^e (solnce) are cognates, both mean- 
ing sun. However, the Bulgarian cuh (sin) and the 
Russian cuh (syn) are only partial cognates: while 
they can both mean son in some contexts, the Bul- 
garian word can also mean blue in other. Finally, 
the Bulgarian 6ucmpoma (bistrota) and the Russian 
6ucmpoma (bystrota) are false friends meaning clear- 
ness and quickness^ respectively. 

False friends are important not only for foreign 
language learning, but also for various natural lan- 
guage processing (nlp) tasks such as statistical ma- 
chine translation, word alignment, automated transla- 
tion quality control, etc. 

In this paper, we propose an unsupervised approach 
to the task of extracting pairs of false friends from a 
sentence-aligned corpus (a bi-text). While we experi- 
ment with Bulgarian and Russian, our general method 
is language-independent. 

The remainder of the paper is organized as follows: 
Section 2 provides an overview of the related work. 
Section 3 present our method. Section 4 lists the re- 
sources we are using. Sections 5 and 6 describe the ex- 
periments and discuss the evaluation results. Section 
7 concludes and suggests directions for future work. 



Two words (or other structures) in related languages 
are cognate if they come from the same original word 
(or other structure). Generally cognates will have 
similar, though often not identical, phonological and 
semantic structures (sounds and meanings). For in- 
stance, Latin tu^ Spanish M, Greek su^ German du^ 
and English thou are all cognates; all mean 'sec- 
ond person singular', but they differ in form and 
in whether they mean specifically 'familiar' (non- 
honorific). 

Following previous researchers in computational linguistics 
[2, 24, 25], we will adopt a simplified definition which ignores 
origin, defining cognates as words in different languages that 
are mutual translations and have a similar orthography. 



2 Related work 

Previous work on extracting false friends from text can 
be divided into the following categories: (1) methods 
for measuring orthographic and phonetic similarity, 
(2) methods for identifying cognates and false friends 
from parallel bi-texts, and (3) semantic methods for 
distinguishing between cognates and false friends. 

Most of the research in the last decade has fo- 
cused on orthographic methods for cognate identifi- 
cation that do not try to distinguish between cognates 
from false friends. While traditional orthographic sim- 
ilarity measures like longest common subsequence ra- 
tio and minimum edit distance have evolved over the 
years towards machine learning approaches for iden- 
tifying cross-lingual orthographic transformation pat- 
terns [2, 26]), recent research has shown interest in 
using semantic evidence as well [26, 27]. 

However, little research was conducted on extracting 
false friends from parallel bi-texts. Very few authors 
proposed such algorithms [31], while most research fo- 
cused on word to word alignment [39] and extracting 
bilingual lexicons with applications to identifying cog- 
nates [9]. 

2.1 Orthographic/phonetic similarity 

In this subsection, we describe some relevant methods 
based on orthographic and phonetic similarity. 

2.1.1 Orthographic similarity 

The first methods proposed for identifying cognates 
were based on measuring orthographic similarity. For 
languages sharing the same alphabet, classical ap- 
proaches include minimum edit distance [22], longest 
common subsequence ratio [25], and variants of the 
Dice coefficient measuring overlap at the level of char- 
acter bigrams [5]. 

The Levenshtein distance or minimum edit distance 
(med) is defined as the minimum number of INSERT, 
REPLACE, and DELETE operations at the character 
level needed to transform one string into another [22]. 
For example, the MED between the Bulgarian word 
nspevjrm {Hhe firsf) and the Russian word nepeuu 
(^firsf) is 4: there are three REPLACE operations, 
namely, 5 — > e, « — > 6i, and sr ^f u, and one DELETE 
operation (to remove m). To be used as a similarity 
measure, it is typically normalized by dividing it 
by the length of the longer word; this normalized 
measure is known as the minimum edit distance ratio 
(medr): 

MEDR(n5]3eufrTO,, nepeuu) = 1 — 4/7 ps 0.43 

The longest common subsequence ratio (lcsr) [25] 
is another classic normalized orthographic similarity 
measure. It is calculated as the ratio of the length 
of the longest common subsequence of the two words 
and the length of the longer word. For example, 
LGs{ndpeufrm, nepeuu) = npe, and thus we have: 

LGSR{ndpeufrm, nepeuu) = 3/7 pa 0.43 



Other approaches to measuring orthographic simi- 
larity between two words have been proposed by [1] 
who calculate the Dice coefficient for character bi- 
grams. Their idea is further extended by [5], who used 
a weighting version of the Dice coefficient, and by [19], 
who proposed a generalized n-gram measure. 

2.1.2 Phonetic similarity 

The phonetic similarity measures the degree to which 
two words sound alike. Unlike orthographic similarity, 
it operates with sounds rather than letter sequences. 

Although a number of specialized algorithms for 
measuring phonetic similarity have been proposed 
[11, 17], it can be also measured using orthographic 
similarity methods after the words have been phonet- 
ically transcribed. This approach also works for lan- 
guages that use different alphabets. 

2.1.3 Using transformation rules 

Recently, some researchers have proposed to apply 
transformation rules that reflect typical cross-lingual 
transformation patterns observed for the target pair 
of languages before measuring orthographic similarity. 
This is a good idea when the languages do not use 
exactly the same alphabet or the same spelling sys- 
tems. Of course, such substitutions do not have to be 
limited to single letters and could be defined for letter 
sequence such as syllables, endings, and prefixes. For 
example, [16] use manually constructed transforma- 
tion rules between German and English for expanding 
a list of cognates: e.g., replacing the German letters 
k and z by the English c, and changing the ending 
German -tat to the English -ty. 

Other researchers have tried to learn automatically 
cross-lingual transformation rules that reflect regular 
phonetic changes between two languages. For exam- 
ple, [27] do that using MED and positive examples pro- 
vided as a list of known cognates. Unlike them, [2] use 
both positive and negative examples to learn weights 
on substring pairings in order to better identify re- 
lated substring transformations. Starting with MED, 
they first obtain an alignment at the letter level. They 
then extract corresponding substrings that are consis- 
tent with that alignment, which in turn are used with a 
support vector machine (svm) classifier to distinguish 
between cognates and false friends. 

2.2 Using parallel bi-texts 

There is little research on extracting false friends from 
text corpora directly. Most methods first extract can- 
didate cognates and false friends using some measure 
of orthographic or phonetic similarity, and then try 
to distinguish between cognates and false friends in a 
second step [26]. 

Fung [9] extracted semantically similar cross-lingual 
word pairs from parallel and comparable corpora using 
binary co-occurrence vectors - one for each side of the 
bi-text. 

Brew and McKelvie [5] used sentence alignment to 
extract cognates and false friends from a bi-text using 
co-occurrence statistics in the aligned sentences. 



Nakov and Pacovski [31] extracted false friends from 
a paragraph-aligned Bulgarian-Macedonian^ bi-text, 
assuming that false friends are unlikely to co-occur in 
paragraphs that are translations of each other, while 
cognates tend to do so. Several formulas formalizing 
this assumption have been proposed and evaluated. 

2.3 Semantic approaches 

2.3.1 Corpus-based approaches 

There has been a lot of research during the last decade 
on measuring semantic similarity with applications to 
finding cognates and false friends. Most of the pro- 
posed approaches are based on the distributional hy- 
pothesis, which states that words occurring in similar 
contexts tend to be semantically similar [12]. This hy- 
pothesis is typically operationalized using the vector- 
space model [38], where vectors are built using words 
from the local context of the target word as coordi- 
nates and their frequencies as values for these coordi- 
nates. For example, Nakov & al. [33] defined the local 
context as a window of a certain size around the target 
word, while other researchers have limited the context 
to words in a particular syntactic relationship with the 
target word, e.g., direct object of a verb [7, 23, 28]. 
These vectors are typically compared using the cosine 
of the angle between them as in [33], but other simi- 
larity measures such as the Dice coefficient have been 
used as well [28]. 

Kondrak [18] proposed an algorithm for measur- 
ing semantic similarity using WordNet [8]. He used 
the following eight semantic similarity levels as binary 
features: gloss identity, keyword identity, gloss syn- 
onymy, keyword synonymy, gloss hypernymy, keyword 
hypernymy, gloss meronymy, and keyword meronymy. 
These features were combined with a measure of pho- 
netic similarity and used in a naive Bayes classifier to 
distinguish between cognates and non-cognates. 

Mitkov & al. [26] proposed several methods for mea- 
suring the semantic similarity between orthographi- 
cally similar pairs of words, which were used to distin- 
guish between cognates and false friends. Their first 
method uses comparable corpora in the two target lan- 
guages and relies on distributional similarity: given a 
cross-lingual pair of words to compare, a set of the 
most similar words to each of the two targets is col- 
lected from the respective monolingual corpus. The 
semantic similarity is calculated as the Dice coefficient 
between these two sets, using a bilingual glossary to 
check whether two words can be translations of each 
other. The method is further extended to use tax- 
onomic data from Euro WordNet [40] when available. 
The second method extracts co-occurrence statistics 
for each of the two words of interest from a respec- 
tive monolingual corpus using a dependency parser. 
In particular, verbs are used as distributional features 
when comparing nouns. Semantic sets are thus cre- 
ated, and the similarity between them is measured us- 
ing the Dice coefficient and a bilingual glossary. 



2.3.2 Web-based approaches 

The idea of using the Web as a corpus is getting in- 
creasingly popular and has been applied to various 
problems. See [15] for an overview. See also [20, 21] 
for some interesting applications, and [14, 29] for a 
discussion on some issues. 

Many researchers have used the Web to identify cog- 
nates and false friends. Some used Web search engines 
as a proxy for n-gram frequency, i.e., to estimate how 
many times a word or a phrase is met on the Web [13], 
whereas others directly retrieved contexts from the re- 
turned text snippets [33]. There have been also some 
combined approaches, e.g., that of Bollegala & al. [4], 
who further learned lexico-syntactic templates for se- 
mantically related and unrelated words using Word- 
Net, which were used for extracting information from 
the text snippets returned by the search engine. 



3 Method 

We propose a method for extracting false friends from 
a bi-text that combines statistical and semantic ev- 
idence in a two-step process: (1) we extract cross- 
lingual pairs of orthographically similar words, and 
(2) we identify which of them are false friends. 

For the sake of definiteness, below we will assume 
that our objective is to extract pairs of false friend 
between Bulgarian and Russian. 

3.1 Finding candidates 

We look for cross-lingual pairs of words that are per- 
ceived as similar and thus could be cognates or false 
friends. First, we extract from the bi-text^ all cross- 
lingual Bulgarian- Russian word pairs {wf,g-, Wm)- We 
then measure the orthographic similarity between wi,g 
and Wrm and we accept the pair as a candidate only 
if that similarity is above a pre-specified threshold. In 
the process, we ignore part of speech, gender, number, 
definiteness, and case, which are expressed as inflec- 
tions in both Bulgarian and Russian. 

We measure the orthographic similarity using the 
modified minimum edit distance ratio (mmedr) algo- 
rithm [30]. First, some Bulgarian-specific letter se- 
quences are replaced by Russian-specific ones. Then, 
a weighted minimum edit distance that is specific to 
Bulgarian and Russian is calculated between the re- 
sulting strings. Finally, that distance is normalized by 
dividing it by the length of the longer word and the 
result is subtracted from 1 so that it can become a 
similarity measure. 

Let us consider for example the Bulgarian word 
nspeusrm {Hhe firsV) and the Russian one nepeuu 
i^firsV). First, they will be normalized to nepeu and 
nepeu, respectively. Then, a single vowel-to-vowel 
REPLACE operation will be performed with the cost 
of 0.5. Finally, the result will be normalized and 
subtracted from one:"* 



There is a heated Unguistic and political debate about 
whether Macedonian represents a separate language or is a 
regional literary form of Bulgarian. Since no clear criteria ex- 
ist for distinguishing a dialect from a language, linguists re- 
main divided on that issue. Politically, Macedonian remains 
unrecognized as a language by Bulgaria and Greece. 



We extract pairs from the whole bi-text ignoring sentence 
alignment, i.e., the words in a pair do not have to come from 
corresponding sentences. 

Compare this result to MEDR and LCSR, which severely un- 
derestimate the similarity: they are both 0.43 for nspeujfm 
and nepebiu. 



MMEDR{ndpeufrm, nepeuu) = 1 — 0.5/7 pa 0.93 

Although MMEDR is an orthographic approach, it 
reflects phonetics to some extent since it uses trans- 
formation rules and edit distance weights that are mo- 
tivated by regular phonetic changes between Bulgarian 
and Russian as reflected in the spelling systems of the 
two languages. See [30] for more details. 

While the MMEDR algorithm could be improved to 
learn transformation rules automatically, e.g., follow- 
ing [2, 26, 27], this is out of the scope of the present 
work. Our main focus below will be on distinguishing 
between cognates and false friends, which is a much 
more challenging task. 

3.2 Identifying false friends 

Once we have collected a list of orthographically 
and/or phonetically similar cross-lingual word pairs, 
we need to decide which of them are false friends, i.e., 
distinguish between false friends and cognates.^ We 
use several approaches for this purpose. 

3.2.1 Sentence-level co-occurrences 

First, we use a statistical approach, based on statis- 
tics about word occurrences and co-occurrences in a 
bi-text. The idea is that cognates tend to co-occur 
in corresponding sentences while false friends do not. 
Following [31], we make use of the following statistics: 

• 4j={wi,g) - the number of Bulgarian sentences in 
the bi-text containing the word w^g; 

• =ff{wru) ^ the number of Russian sentences in the 
bi-text containing the word Wm] 

• ifiwbg,Wru) ^ the number of corresponding sen- 
tence pairs in the bi-text containing the word Wf,g 
on the Bulgarian side and the word w^.^ on the 
Russian side. 

Nakov & Pacovski [31] tried various combinations 
of these statistics; in their experiments, the best- 
performing formula was the following one: 



Feiwbg^Wru) = 



i^{wbg,Wru) + 1 






Now, note that we have the following inequalities: 

#(ti'f,g) > #iwbg,Wru) 
#{Wru) > #iwbg,Wru) 

Thus, having a high number of co-occurrences 
i^{wbg,Wj.u) should increase the probability that the 
words Wbg and Wj-^ are cognates. At the same time, a 
big difference between =f^{wbg) and =f^{wru) should in- 
crease the likelihood of them being false friends. Based 
on these observations, we propose the following extra 
formulas {Ei and £2)'- 



^ Since our ultimate objective is to extract pairs of false friends, 
we do not need to distinguish between true and partial cog- 
nates; we will thus use the term cognates to refer to both. 



Ei{wbg,Wru) 



(#iwbg,Wru) + 1) 
(#Kg) + l)(#K„) + l) 



E2{wbg,Wru) = 



i#{wbg,Wru) + 1) 

PxQ 



where 



P = #{wbg) - #{wbg, Wru) + 1 

Q = #{Wru) - #iwbg,Wru) + 1 

Finally, unlike [31], we perform lemmatization be- 
fore calculating the above statistics - Bulgarian and 
Russian are highly inflectional languages, and words 
are expected to have several inflected forms in the bi- 
text. 

3.2.2 Word alignments 

Our next information source are lexical probabilities 
for cross-lingual word pairs: they should be high for 
cognates and low for false friends. Such probabilities 
can be easily estimated from word alignments, which 
in turn can be obtained using techniques from statis- 
tical machine translation. 

We start by tokenizing and lowercasing both sides 
of the training bi-text. We then build separate di- 
rected word alignments for Bulgarian— > Russian and 
Russian— ^Bulgarian using IBM model 4 [6], and 
we combine them using the intersect+grow heuris- 
tic described in [34]. Using the resulting undirected 
alignment, we estimate lexical translation probabili- 
ties PT{wbg\wru) and Pr(wru|w6g) for all Bulgarian- 
Russian word pairs that co-occur in aligned sentences 
in the bi-text. Finally, we define a cross- lingual se- 
mantic similarity measure as follows: 



lex{wbg,Wru) 



Fr(wbg\Wru) + Pr{Wru\wbg) 



Note that the above definition has an important 
drawback: it is zero for all words that do not co-occur 
in corresponding sentences in the bi-text. Thus, we 
will never use lex(wbg,Wru) alone but only in combi- 
nation with other measures. 

3.2.3 Web similarity 

Next, we use an algorithm described in [33], which, 
given a Russian word Wru and a Bulgarian word Wbg 
to be compared, measures the semantic similarity be- 
tween them using the Web as a corpus and a glossary 
G of known Russian-Bulgarian translation pairs, used 
as "bridges" . The basic idea is that if two words are 
translations, then the words in their respective local 
contexts should be translations as well. The idea is for- 
malized using the Web as a corpus, a glossary of known 
word translations serving as cross-linguistic "bridges" , 
and the vector space model. We measure the semantic 
similarity between a Bulgarian and a Russian word, 
Wbg and Wru, by construct corresponding contextual 
semantic vectors Vbg and Vru, translating Vm into Bul- 
garian, and comparing it to Vbg. 



The process of building V^g, starts with a query 
to Google limited to Bulgarian pages for the target 
word Wf,g. We collect the resulting text snippets (up 
to 1,000), and we remove all stop words - preposi- 
tions, pronouns, conjunctions, interjections and some 
adverbs. We then identify the occurrences of Wf,g, and 
we extract the words from its local context - three 
words on either side. We filter out the words that 
do not appear on the Bulgarian side of G. We do 
this for all text snippets. Finally, for each retained 
word, we calculate the number of times it has been 
extracted, thus producing a frequency vector Vhg. We 
repeat the procedure for Wm to obtain a Russian fre- 
quency vector Vj-u, which is then "translated" into Bul- 
garian by replacing each Russian word with its transla- 
tion(s) in G, retaining the co-occurrence frequencies. 
In case there are multiple Bulgarian translations for 
some Russian word, we distribute the corresponding 
frequency equally among them, and in case of multiple 
Russian words with the same Bulgarian translation, we 
sum up the corresponding frequencies. As a result, we 
end up with a Bulgarian vector Vru^bg for the Russian 
word Wru- Finally, we calculate the semantic similar- 
ity between wi,g and Wm as the cosine of the angle be- 
tween their corresponding Bulgarian vectors, Vjg and 



Vru^bg, as foUoWS: 



sim{wbg,Wru) 



^bg- ^ru — ybg 
\Vbg\x \Vru^bg\ 



3.2.4 Combined approach 



4 Resources 

For the purpose of our experiments, we used the follow- 
ing resources: a bi-text, lemmatization dictionaries, a 
bilingual glossary, and the Web as a corpus. 

4.1 Bi-text 

We used the first seven chapters of the Russian novel 
Lord of the World by Alexander Belyaev and its Bul- 
garian translation consisting of 759 parallel sentences. 
The text has been sentence aligned automatically us- 
ing the alignment tool MARK ALISTeR [36], which is 
based on the Gale-Church algorithm [10]. 

4.2 Morphological dictionaries 

We used two large monolingual morphological dictio- 
naries for Bulgarian and Russian created at the Lin- 
guistic Modeling Department of the Institute for Par- 
allel Processing in the Bulgarian Academy of Sciences 
[35]. The Bulgarian dictionary contains about IM 
wordforms and 70K lemmata, and the Russian one 
contains about 1.5M wordforms and lOOK lemmata. 

4.3 Bilingual glossary 

We built a bilingual glossary by adapting an online 
Russian-Bulgarian dictionary. First, we removed all 
multi-word expressions. Then we combined each Rus- 
sian word with each of its Bulgarian translations - 
due to polysemy/homonymy some words had multiple 
translations. We thus obtained a glossary of 59,583 
pairs of words that are translations of each other. 



While all three approaches described above - sentence- 
level co-occurrences, word alignments, and Web simi- 
larity - are useful for distinguishing between cognates 
and false friends, each of them has some weaknesses. 

Using sentence-level co-occurrences and word align- 
ments is a good idea when the statistics for the tar- 
get words are reliable, but both work poorly for in- 
frequent words. Unfortunately, most words (and thus 
most word pairs) will be rare for any (bi-)text, accord- 
ing to the Zipf law [41], 

Data sparsity is less of an issue for the Web simi- 
larity approach, which uses statistics derived from the 
largest existing corpus - the Web. Still, while being 
quite reliable for unrelated words, it sometimes assigns 
very low scores to highly-related word pairs. The prob- 
lem stems from it relying on a commercial search en- 
gine like Google, which only returns up to 1,000 re- 
sults per query and rates too high sites about news, 
e-commerce, and blogs, which introduces a bias on the 
local contexts of the target words. Moreover, some ge- 
ographical and cultural notions have different contexts 
on the Web for Bulgarian and Russian, despite being 
very related otherwise, e.g., person names and goods 
used in e-commerce (due to different popular brands 
in different countries). 

A natural way to overcome these issues is to combine 
all three approaches, e.g., by taking the average of the 
similarity each of them predicts. As we will see below, 
this is indeed quite a valuable strategy. 



4.4 Web as a corpus 

In our experiments, we performed searches in Google 
for 557 Bulgarian and for 550 Russian wordforms, and 
we collected as many as possible (up to 1,000) page 
titles and text snippets from the search results. 



5 Experiments and evaluation 

We performed several experiments in order to evalu- 
ate the above-described approaches - both individu- 
ally and in various combinations.^ 

First, we extracted cross- lingual pairs of ortho- 
graphically similar words using the MMEDR algorithm 
with a threshold of 0.90. This yielded 612 pairs, each of 
which was judged by a linguist^ as being a pair of false 
friends, partial cognates, or true cognates. There were 
35 false friends (5.72%), 67 partial cognates (10.95%), 
and 510 true cognates (83.33%). 

Then, we applied different algorithms to distinguish 
which of the 612 pairs are false friends and which are 
cognates (partial or true) . Each algorithm assigned a 
similarity score to each pair and then the pairs were 
sorted by that score in descending order. Ideally, the 
false friends, having a very low similarity, should ap- 
pear at the top of the list, followed by the cognates. 



^ Some of the experiments have been published in [32]. 
^ The hnguist judged the examples as cognates or false friends 
by consulting Bulgarian-Russian bilingual dictionaries. 



Following [2] and [33], the resulting lists were evalu- 
ated using 11-pt average precision, which is well-known 
in information retrieval [37]; it averages the precision 
at 11 points corresponding to recall of 0%, 10%, 20%, 
. . ., 100%, respectively. 

We performed the following experiments: 

• BASELINE - word pairs in alphabetical order: it 
behaves nearly like a random function and is used 
as a baseline; 

• COOC - the sentence-level co-occurrence algo- 
rithm of [31] with their formula Fg; 

• COOC+L - the algorithm COOC with lemmati- 
zation; 



COOC+El 
mula El] 



the algorithm COOC with the for- 



COOC+El+L - the algorithm COOC with the 
formula Ei and lemmatization; 



C00C+E2 

mula E2; 



the algorithm COOC with the for- 



C00C+E2+L - the algorithm COOC with the 
formula E2 and lemmatization; 

WEB+L - the Web-based semantic similarity 
with lemmatization; 

WEB+COOC+L - averaging the values of 
WEB+L and COOC+L; 



WEB+El+L 
and El+L: 



averaging the values of WEB+L 



• WEB+E2+L - averaging the values of WEB+L 
and E2+L; 

• WEB+SMT+L - the algorithm WEB+L com- 
bined with the statistical machine translation 
similarity measure by averaging the values of 
WEB+L and the estimated lexical translation 
probability; 

• COOC+SMT+L - the algorithm COOC+L com- 
bined with the machine translation similarity by 
averaging the similarity scores; 

• El+SMT+L - the algorithm El+L combined 
with the machine translation similarity by aver- 
aging the similarity scores; 

• E2+SMT+L - the algorithm E2+L combined 
with the machine translation similarity by aver- 
aging the similarity scores; 

• WEB+COOC+SMT+L - averaging WEB+L, 
COOC+L, and the machine translation similar- 
ity; 

• WEB+El+SMT+L - averaging WEB+L, El+L, 
and the machine translation similarity; 

• WEB+E2+SMT+L - averaging WEB+L, E2+L 
and the machine translation similarity. 

The results are shown in Table 1. We also tried 
several ways of weighting the statistical and semantic 
components in some of the above algorithms, but this 
had little impact on the results. 



Algorithm 


11-pt average precision 


BASELINE 


4.17% 


E2 


38.60% 


El 


39.50% 


COOC 


43.81% 


COOC+L 


53.20% 


COOC+SMT+L 


56.22% 


WEB+COOC+L 


61.28% 


WEB+COOC+SMT+L 


61.67% 


WEB+L 


63.68% 


El+L 


63.98% 


El+SMT+L 


65.36% 


E2+L 


66.82% 


WEB+SMT+L 


69.88% 


E2+SMT+L 


70.62% 


WEB+E2+L 


76.15% 


WEB+El+SMT+L 


76.35% 


WEB+El+L 


77.50% 


WEB+E2+SMT+L 


78.24% 



Table 1: Performance of the different methods sorted 
by 11-pt average precision (in %). 



6 Discussion 

Table 1 shows that most algorithms perform well above 
the baseline, with the exception of E\ and E2 when 
used in isolation. However, when combined with 
lemmatization, both E\ and E2 perform much better 
than the original formula Fq (the COOC algorithm) 
of [31]: Bulgarian and Russian are highly inflectional 
languages and thus applying lemmatization is a must. 
Not surprisingly, the best results are obtained for the 
combined approaches. 

Overall, we observe signiflcant improvements over 
the original algorithm of [31], but the results are still 
not perfect. 



7 Conclusions and future work 

We have presented and compared several algorithms 
for acquiring pairs of false friends from a sentence- 
aligned bi-text based on sentence-level co-occurrence 
statistics, word alignments, the Web as a corpus, 
lemmatization, and various combinations thereof. The 
experimental results show a significant improvement 
over [31]. 

There are many ways in which the results could be 
improved even further. First, we would like to try 
other formulas for measuring the semantic similarity 
using word-level occurrences and co-occurrences in a 
bi-text. It would be also interesting to try the contex- 
tual mapping vectors approach of [9] . We could try us- 
ing additional bi-texts to estimate more reliable word 
alignments, and thus, obtain better lexical probabili- 
ties. Using non-parallel corpora as in [26] is another 
promising direction for future work. The Web-based 
semantic similarity calculations could be improved us- 
ing syntactic relations between the words as in [4]. Fi- 
nally we would like to try the algorithm for other lan- 
guage pairs and to compare our results directly with 
those of other researchers - on the same datasets. 
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